Monitoring, Observability & Infrastructure Interview Guide
Prometheus (25 Questions)โ
Core Conceptsโ
- What is Prometheus? Key features
- What is the Prometheus architecture? Components (Prometheus Server, Pushgateway, Alertmanager, Exporters)
- What is a time-series database? How does Prometheus store data?
- What is the difference between monitoring and observability?
- What are the four golden signals of monitoring? (Latency, Traffic, Errors, Saturation)
Metrics & Data Modelโ
- What are metrics in Prometheus? Types of metrics:
- Counter
- Gauge
- Histogram
- Summary
- What is a metric label? How to use labels effectively?
- What is cardinality? Why is high cardinality problematic?
- What is the difference between histogram and summary?
- When to use Counter vs Gauge?
- What is metric naming convention in Prometheus?
- What is the data retention period in Prometheus?
PromQL (Prometheus Query Language)โ
- What is PromQL? Basic query syntax
- What is an instant vector vs range vector?
- Common PromQL functions:
rate()- Calculate per-second rateirate()- Instant rateincrease()- Total increasesum(),avg(),max(),min()byandwithoutaggregations
- What is the difference between
rate()andirate()? - How to calculate percentiles in PromQL? (histogram_quantile)
- How to filter metrics by labels in PromQL?
- What is
upmetric? How to use it for health checks?
Integration & Exportersโ
- What is an exporter in Prometheus?
- Common exporters:
- Node Exporter (system metrics)
- JMX Exporter (Java applications)
- Blackbox Exporter (endpoint monitoring)
- Custom exporters
- How to integrate Prometheus with Spring Boot? (Micrometer, Actuator)
- What is Pushgateway? When to use it?
- What is service discovery in Prometheus? (static config, DNS, Kubernetes, Consul)
Alertingโ
- What is Alertmanager? How does it work?
- How to configure alerts in Prometheus?
- What is alert routing and grouping?
- What is silencing in Alertmanager?
- What are best practices for alert thresholds?
Grafana (20 Questions)โ
Core Conceptsโ
- What is Grafana? Key features
- What is the difference between Prometheus and Grafana?
- What are Grafana data sources? (Prometheus, InfluxDB, Elasticsearch, CloudWatch, MySQL)
- What is a dashboard in Grafana?
- What is a panel? Types of panels (Graph, Gauge, Table, Heatmap, Stat)
Dashboards & Visualizationโ
- How to create a dashboard in Grafana?
- What are dashboard variables? Use cases
- What is templating in Grafana?
- How to create dynamic dashboards?
- What are dashboard annotations?
- What is the difference between absolute time and relative time ranges?
- How to share dashboards? (export JSON, snapshots, public dashboards)
Alertingโ
- How to configure alerts in Grafana?
- What are notification channels? (Email, Slack, PagerDuty, Webhook)
- What is alert state? (Pending, Alerting, OK, No Data)
- What is the difference between Grafana alerts and Prometheus alerts?
Advanced Featuresโ
- What is Grafana Loki? How is it different from Elasticsearch?
- What is Grafana Tempo? (distributed tracing)
- What are Grafana plugins?
- How to use Grafana with Kubernetes?
- What are best practices for dashboard design?
ELK Stack (Elasticsearch, Logstash, Kibana) (30 Questions)โ
Elasticsearchโ
- What is Elasticsearch? Core concepts
- What is an index in Elasticsearch?
- What is a document and document ID?
- What is a shard? Primary shard vs replica shard
- Why is sharding important in Elasticsearch?
- What is an inverted index?
- What is a mapping in Elasticsearch?
- What are analyzers? Common analyzers (Standard, Whitespace, Keyword, Pattern)
- What is the difference between text and keyword data types?
- How does Elasticsearch achieve near real-time search?
- What is a cluster, node, and index in Elasticsearch?
- What is the difference between GET and SEARCH API?
- What are query types in Elasticsearch?
- Match query
- Term query
- Range query
- Bool query
- Wildcard query
- What is aggregation in Elasticsearch? (Bucket, Metric, Pipeline)
- What is the difference between term query and match query?
- How to perform full-text search in Elasticsearch?
- What is scoring and relevance in Elasticsearch?
- How to optimize Elasticsearch performance?
- What is circuit breaker in Elasticsearch?
- What is index lifecycle management (ILM)?
Logstashโ
- What is Logstash? Architecture
- What are the three stages of Logstash pipeline? (Input, Filter, Output)
- Common Logstash input plugins (file, beats, kafka, jdbc)
- Common Logstash filter plugins (grok, mutate, date, json, geoip)
- Common Logstash output plugins (elasticsearch, file, kafka, stdout)
- What is Grok pattern? How to use it?
- What is the difference between Logstash and Filebeat?
- How to handle log parsing errors in Logstash?
Kibanaโ
- What is Kibana? Key features
- What is Discover in Kibana?
- How to create visualizations in Kibana? (Bar, Line, Pie, Heatmap, Data Table)
- What is a Kibana dashboard?
- What are index patterns in Kibana?
- What is Kibana Query Language (KQL)?
- What is Kibana Lens?
- How to create alerts in Kibana?
- What is Canvas in Kibana?
ELK Stack Integration & Best Practicesโ
- What is the typical ELK stack workflow?
- What are Beats? (Filebeat, Metricbeat, Packetbeat, Heartbeat, Auditbeat)
- When to use Logstash vs Filebeat?
- How to secure ELK stack? (authentication, encryption, role-based access)
- What are best practices for log management?
- How to handle large-scale log ingestion?
- What is hot-warm-cold architecture in Elasticsearch?
Apache Kafka (35 Questions)โ
Core Conceptsโ
- What is Apache Kafka? Use cases
- What is the Kafka architecture? Components:
- Broker
- Topic
- Partition
- Producer
- Consumer
- Zookeeper/KRaft
- What is a topic in Kafka?
- What is a partition? Why is partitioning important?
- What is a broker in Kafka?
- What is a Kafka cluster?
- What is the role of Zookeeper in Kafka?
- What is KRaft mode? Difference from Zookeeper
- What is a message/record in Kafka? (Key, Value, Timestamp, Headers)
Producersโ
- What is a Kafka producer?
- How does a producer send messages to Kafka?
- What is producer acknowledgment (acks)? (0, 1, all/-1)
- What is idempotent producer?
- What is the difference between sync and async send?
- What is partitioner? How does producer choose partition? (key-based, round-robin, custom)
- What is producer batching?
- What are producer configuration parameters?
batch.sizelinger.mscompression.typemax.in.flight.requests.per.connection
Consumersโ
- What is a Kafka consumer?
- What is a consumer group?
- How does Kafka achieve load balancing among consumers?
- What is consumer offset?
- What is offset commit? Auto-commit vs manual commit
- What happens when a consumer fails? (rebalancing)
- What is consumer lag? How to monitor it?
- What is the difference between
poll()andsubscribe()? - What is
enable.auto.commit? - What are consumer configuration parameters?
group.idauto.offset.reset(earliest, latest, none)max.poll.recordssession.timeout.ms
Replication & Fault Toleranceโ
- What is replication in Kafka?
- What is replication factor?
- What is leader and follower?
- What is ISR (In-Sync Replica)?
- How does Kafka ensure message durability?
- What is min.insync.replicas?
- What happens when a broker fails?
- What is unclean leader election?
Performance & Scalabilityโ
- How does Kafka achieve high throughput?
- What is log compaction?
- What is retention policy in Kafka? (time-based, size-based)
- How to scale Kafka? (add brokers, increase partitions)
- What is the relationship between partitions and parallelism?
- What are Kafka performance tuning tips?
- What is the difference between Kafka and traditional message queues? (RabbitMQ, ActiveMQ)
Kafka Streams & Connectโ
- What is Kafka Streams?
- What is Kafka Connect?
- What are source and sink connectors?
- When to use Kafka Streams vs Kafka Connect?
Monitoring & Operationsโ
- How to monitor Kafka? (JMX metrics, Kafka Manager, Burrow)
- Important Kafka metrics to monitor:
- UnderReplicatedPartitions
- OfflinePartitionsCount
- ActiveControllerCount
- RequestHandlerAvgIdlePercent
- What is Kafka MirrorMaker?
Redis (30 Questions)โ
Core Conceptsโ
- What is Redis? Key features
- What makes Redis fast? (in-memory, single-threaded, efficient data structures)
- What are Redis data types?
- String
- List
- Set
- Sorted Set
- Hash
- Bitmap
- HyperLogLog
- Stream
- What is Redis use case? (caching, session storage, rate limiting, real-time analytics)
- What is the difference between Redis and Memcached?
- Is Redis single-threaded? How does it handle concurrent requests?
Data Structures & Commandsโ
- Important String commands (SET, GET, INCR, DECR, MSET, MGET)
- Important List commands (LPUSH, RPUSH, LPOP, RPOP, LRANGE)
- Important Set commands (SADD, SMEMBERS, SINTER, SUNION, SDIFF)
- Important Sorted Set commands (ZADD, ZRANGE, ZRANK, ZINCRBY)
- Important Hash commands (HSET, HGET, HGETALL, HINCRBY)
- What is the time complexity of common Redis operations?
- What is SCAN command? Difference from KEYS
- What are Redis transactions? (MULTI, EXEC, DISCARD, WATCH)
Persistenceโ
- What are Redis persistence mechanisms?
- RDB (Redis Database Backup)
- AOF (Append-Only File)
- What is the difference between RDB and AOF?
- When to use RDB vs AOF?
- What is hybrid persistence (RDB+AOF)?
- What is snapshotting in Redis?
Caching Strategiesโ
- What are caching strategies?
- Cache-Aside (Lazy Loading)
- Write-Through
- Write-Behind (Write-Back)
- Read-Through
- What is cache eviction policy? (LRU, LFU, FIFO, Random, TTL)
- What is TTL (Time To Live)?
- How to handle cache stampede?
- What is cache penetration, cache breakdown, and cache avalanche?
- How to implement distributed locking in Redis? (SETNX, RedLock algorithm)
High Availability & Scalabilityโ
- What is Redis Sentinel? How does it work?
- What is Redis Cluster? How does it achieve scalability?
- What is the difference between Redis Sentinel and Redis Cluster?
- How does Redis Cluster handle data sharding?
- What is hash slot in Redis Cluster?
- What is split-brain problem in Redis?
- What is Redis replication? Master-slave architecture
- How to handle failover in Redis?
Performance & Monitoringโ
- How to monitor Redis? (INFO command, redis-cli, monitoring tools)
- Important Redis metrics:
- Memory usage
- Hit rate
- Connected clients
- Commands processed per second
- Evicted keys
- How to optimize Redis performance?
- What is pipelining in Redis?
- What is Redis pub/sub?
- What are Redis Streams? Use cases
- What is the maximum size of a Redis key/value?
CDN (Content Delivery Network) (20 Questions)โ
Core Conceptsโ
- What is CDN? How does it work?
- What are the benefits of using CDN? (reduced latency, improved performance, DDoS protection, reduced bandwidth cost)
- What is edge server/edge location?
- What is origin server?
- What is Point of Presence (PoP)?
- How does CDN routing work? (DNS-based routing, Anycast)
CDN Types & Architectureโ
- What are types of CDN?
- Push CDN
- Pull CDN
- What is the difference between push and pull CDN?
- What is CDN caching? Cache hierarchy
- What is cache hit ratio?
- What is Time To Live (TTL) in CDN?
- What is cache invalidation/purging?
CDN Featuresโ
- What is edge computing?
- What is CDN load balancing?
- How does CDN handle dynamic content?
- What is CDN SSL/TLS termination?
- What is geo-blocking in CDN?
- What are CDN security features? (DDoS protection, WAF, bot protection)
- What is image optimization in CDN?
- What is compression in CDN? (Gzip, Brotli)
Popular CDN Providersโ
- Popular CDN providers (Cloudflare, Akamai, AWS CloudFront, Fastly, Azure CDN)
- What is CloudFlare? Key features
- What is AWS CloudFront?
- How to integrate CDN with your application?
Spring Boot Actuator (25 Questions)โ
Core Conceptsโ
- What is Spring Boot Actuator?
- How to enable Actuator in Spring Boot?
- What are Actuator endpoints?
- What is the difference between web endpoints and JMX endpoints?
Built-in Endpointsโ
- Important Actuator endpoints:
/actuator/health- Application health/actuator/info- Application information/actuator/metrics- Application metrics/actuator/env- Environment properties/actuator/beans- Spring beans/actuator/mappings- Request mappings/actuator/loggers- Logger configuration/actuator/threaddump- Thread dump/actuator/heapdump- Heap dump/actuator/prometheus- Prometheus metrics
- What is health indicator? (disk space, database, Redis, custom)
- How to create custom health indicators?
- What is health status? (UP, DOWN, OUT_OF_SERVICE, UNKNOWN)
- How to expose/hide specific endpoints?
- What is the
/infoendpoint? How to add custom info?
Metricsโ
- What metrics are available in Actuator?
- JVM metrics (memory, threads, GC)
- HTTP metrics (request count, response time)
- Database metrics (connection pool)
- Custom metrics
- How to create custom metrics? (MeterRegistry, Counter, Gauge, Timer)
- What is Micrometer?
- How to integrate Actuator with Prometheus?
- What is dimensional metrics?
Security & Configurationโ
- How to secure Actuator endpoints?
- What is the role of Spring Security with Actuator?
- How to configure endpoint exposure? (
management.endpoints.web.exposure.include) - What is base path for Actuator? (
management.endpoints.web.base-path) - How to customize Actuator endpoints?
- What is
@Endpointannotation?
Advanced Featuresโ
- How to create custom Actuator endpoints?
- What is
/auditeventsendpoint? - How to monitor application performance using Actuator?
- How to integrate Actuator with external monitoring systems? (Grafana, Prometheus, ELK)
Application Performance Monitoring (APM) (20 Questions)โ
Core Conceptsโ
- What is APM? Why is it important?
- What is distributed tracing?
- What is a trace, span, and trace ID?
- What is observability vs monitoring?
- Three pillars of observability (Logs, Metrics, Traces)
APM Toolsโ
- Popular APM tools:
- New Relic
- Datadog
- AppDynamics
- Dynatrace
- Elastic APM
- Jaeger
- Zipkin
- What is Zipkin? How does it work?
- What is Jaeger? Architecture
- What is Spring Cloud Sleuth?
- How to implement distributed tracing in Spring Boot? (Sleuth + Zipkin)
Metrics & Monitoringโ
- What is Apdex score?
- What is response time percentiles? (p50, p95, p99)
- What is throughput and latency?
- What is error rate?
- How to monitor database query performance?
- How to identify performance bottlenecks?
- What is transaction tracing?
- What is Real User Monitoring (RUM)?
- What is Synthetic Monitoring?
- What is the difference between RUM and Synthetic Monitoring?
API Monitoring & Management (20 Questions)โ
API Monitoringโ
- What is API monitoring? Why is it important?
- What metrics should be monitored for APIs?
- Response time
- Error rate
- Request rate
- Availability/Uptime
- Latency
- How to monitor API endpoints? (health checks, synthetic monitoring)
- What is API uptime monitoring?
- What are API monitoring tools? (Postman, Runscope, Pingdom, Uptime Robot)
- How to implement API health checks in Spring Boot?
API Gateway Monitoringโ
- What is API Gateway?
- What metrics to monitor in API Gateway?
- How to monitor API Gateway performance?
- What is rate limiting in API Gateway?
- How to implement throttling?
API Logging & Analyticsโ
- What should be logged for APIs?
- Request/Response
- Headers
- Timestamps
- User information
- Errors
- How to implement structured logging for APIs?
- What is API analytics?
- How to track API usage patterns?
- What is request tracing?
- How to correlate logs across microservices? (correlation ID)
API Security Monitoringโ
- How to monitor API security?
- What are common API security threats? (DDoS, SQL injection, unauthorized access)
- How to detect API abuse?
- What is anomaly detection in API monitoring?
Log Management & Best Practices (20 Questions)โ
Logging Fundamentalsโ
- What are log levels? (TRACE, DEBUG, INFO, WARN, ERROR, FATAL)
- When to use each log level?
- What is structured logging?
- What is the difference between structured and unstructured logs?
- What is JSON logging? Benefits
- What should be included in log messages?
- Timestamp
- Log level
- Service name
- Correlation ID
- User ID
- Error details
Logging Frameworksโ
- Popular Java logging frameworks:
- Logback
- Log4j2
- SLF4J (Simple Logging Facade)
- What is SLF4J? Why use it?
- What is the difference between Log4j, Log4j2, and Logback?
- How to configure logging in Spring Boot?
- What is logging pattern/layout?
Log Aggregationโ
- What is log aggregation? Why is it important?
- What is centralized logging?
- How to implement centralized logging in microservices?
- What is log retention policy?
- How to handle log rotation?
- What is log sampling?
Best Practicesโ
- What are logging best practices?
- Use appropriate log levels
- Include context information
- Avoid logging sensitive data
- Use correlation IDs
- Implement log sampling for high-volume systems
- How to avoid logging sensitive information? (passwords, credit cards, PII)
- How to optimize log storage costs?
- What is log enrichment?
- How to search and analyze logs efficiently?
Alerting & Incident Management (20 Questions)โ
Alerting Fundamentalsโ
- What is alerting? Why is it important?
- What makes a good alert?
- What is alert fatigue? How to prevent it?
- What is the difference between alert and notification?
- What are alert severity levels? (Critical, High, Medium, Low)
Alert Typesโ
- What are different types of alerts?
- Threshold-based alerts
- Anomaly detection alerts
- Composite alerts
- What is threshold alerting?
- What is anomaly-based alerting?
- What is alert aggregation?
- What is alert deduplication?
Alert Configurationโ
- What factors to consider when setting alert thresholds?
- What is alert hysteresis?
- What is alert flapping? How to prevent it?
- What is alert routing?
- What is on-call rotation?
- How to prioritize alerts?
Incident Managementโ
- What is incident management process?
- What is incident severity classification?
- What is MTTR (Mean Time To Repair)?
- What is MTTD (Mean Time To Detect)?
- What is MTTA (Mean Time To Acknowledge)?
- What are incident management tools? (PagerDuty, Opsgenie, VictorOps)
- What is incident postmortem? Why is it important?
- What is runbook/playbook?
Scenario-Based Questions (40 Questions)โ
Performance Issuesโ
- Your application response time suddenly increased. How would you troubleshoot?
- How would you identify if the issue is in application, database, or network?
- CPU usage is at 100%. How would you investigate?
- Memory usage is continuously growing. How would you detect memory leaks?
- Database queries are slow. How would you optimize?
- How would you handle a sudden spike in traffic?
- Application is timing out. How would you debug?
Monitoring & Observabilityโ
- How would you set up monitoring for a new microservice?
- What metrics would you monitor for a REST API?
- How would you monitor database performance?
- How would you implement distributed tracing across 10 microservices?
- How would you correlate logs across multiple services?
- How would you monitor Kafka consumer lag?
- How would you detect if a microservice is down?
- How would you monitor Redis cache hit rate?
Alerting & Incident Responseโ
- You received an alert about high error rate. What steps would you take?
- How would you configure alerts to avoid false positives?
- Multiple alerts are firing. How would you prioritize?
- A critical service is down at 3 AM. Walk through your incident response process
- How would you implement on-call rotation for your team?
- How would you conduct a postmortem after an incident?
ELK Stack Scenariosโ
- Elasticsearch cluster is slow. How would you optimize?
- How would you handle log ingestion of 1TB/day?
- Elasticsearch nodes are running out of memory. What would you do?
- How would you search for specific error messages across millions of logs?
- How would you implement log retention policy for cost optimization?
- Kibana dashboards are loading slowly. How would you troubleshoot?
Kafka Scenariosโ
- Kafka consumer lag is increasing. How would you address it?
- A Kafka broker went down. What happens?
- How would you handle Kafka rebalancing issues?
- Messages are being duplicated. How would you ensure exactly-once delivery?
- How would you migrate Kafka cluster without downtime?
- How would you scale Kafka to handle 10x traffic?
Redis Scenariosโ
- Redis is running out of memory. What would you do?
- Cache hit rate is very low. How would you improve it?
- How would you handle cache stampede during peak traffic?
- Redis master went down. How does failover work?
- How would you implement rate limiting using Redis?
- How would you migrate from single Redis instance to Redis Cluster?
Prometheus & Grafana Scenariosโ
- Prometheus is consuming too much storage. How would you optimize?
- How would you monitor multiple Kubernetes clusters with Prometheus?
- Grafana dashboard is not showing recent data. What could be wrong?
- How would you create a dashboard for database performance monitoring?
- How would you set up alerts for API latency > 500ms?
CDN & API Gateway Scenariosโ
- CDN cache hit rate is low. How would you improve it?
- Static assets are not being cached. How would you debug?
- How would you handle CDN cache invalidation for a critical update?
- API Gateway is becoming a bottleneck. How would you scale?
- How would you implement rate limiting at API Gateway level?
System Design with Monitoringโ
- Design a monitoring system for an e-commerce application
- How would you monitor a microservices-based system with 50+ services?
- Design an alerting strategy for a payment processing system
- How would you implement observability in a serverless architecture?
- Design a logging strategy for a multi-region deployment
Best Practices & Guidelines (25 Questions)โ
Monitoring Best Practicesโ
- What are the key principles of effective monitoring?
- Monitor what matters
- Keep it simple
- Avoid alert fatigue
- Use meaningful metrics
- What is the USE method? (Utilization, Saturation, Errors)
- What is the RED method? (Rate, Errors, Duration)
- What metrics should be monitored at different layers?
- Application layer
- Infrastructure layer
- Network layer
- Database layer
- How to establish SLO (Service Level Objectives)?
- What is the difference between SLI, SLO, and SLA?
Logging Best Practicesโ
- What are logging best practices in microservices?
- How to implement correlation across distributed systems?
- What should never be logged? (passwords, tokens, PII, credit cards)
- How to balance between detailed logging and performance?
- What is the cost of excessive logging?
Alerting Best Practicesโ
- What makes an actionable alert?
- How many alerts are too many?
- What is alert-to-noise ratio?
- Should you alert on symptoms or causes?
- What is the difference between alerts and notifications?
Performance Best Practicesโ
- What are performance monitoring best practices?
- How to establish baseline metrics?
- What is capacity planning?
- How to perform load testing with monitoring?
- What is chaos engineering? How does monitoring help?
Security & Complianceโ
- How to ensure sensitive data is not exposed in logs?
- What are compliance requirements for log retention? (GDPR, HIPAA)
- How to implement audit logging?
- How to secure monitoring endpoints?
- What access controls should be in place for monitoring systems?
Tools Comparison (10 Questions)โ
- Prometheus vs InfluxDB - When to use which?
- Prometheus: Pull-based, optimized for metrics/monitoring, strong alerting, better for Kubernetes, PromQL
- InfluxDB: Push-based, general-purpose time-series DB, better for IoT/sensor data, InfluxQL/Flux, built-in data retention
- Use Prometheus for infrastructure monitoring, InfluxDB for application analytics
- ELK Stack vs Splunk - Pros and cons
- ELK Stack:
- Pros: Open-source, cost-effective, flexible, large community
- Cons: Complex setup, resource-intensive, requires maintenance
- Splunk:
- Pros: Enterprise features, powerful analytics, better support, easier setup
- Cons: Expensive licensing, cost scales with data volume
- Use ELK for cost-sensitive projects, Splunk for enterprise with budget
- Grafana vs Kibana - Key differences
- Grafana: Multi-source visualization, better for metrics/time-series, cleaner dashboards, alerting
- Kibana: Tightly integrated with Elasticsearch, better for logs, built-in analytics, Elastic ecosystem
- Use Grafana for metrics dashboards, Kibana for log analysis
- Kafka vs RabbitMQ - Use cases
- Kafka:
- High throughput, distributed streaming, log aggregation, event sourcing
- Durable, replay capability, horizontal scaling
- RabbitMQ:
- Traditional message queue, complex routing, low latency, easier setup
- Better for request-reply patterns
- Use Kafka for event streaming/big data, RabbitMQ for traditional messaging
- Redis vs Memcached - Key differences
- Redis:
- Multiple data structures, persistence, pub/sub, clustering, Lua scripting
- Single-threaded, feature-rich
- Memcached:
- Simple key-value only, multi-threaded, no persistence
- Slightly faster for simple caching
- Use Redis for complex use cases, Memcached for simple distributed caching
- Logstash vs Fluentd - Comparison
- Logstash:
- Elastic ecosystem, rich plugins, Grok patterns, Java-based (resource heavy)
- Fluentd:
- Lightweight (Ruby/C), better performance, Cloud Native, CNCF project
- JSON native, unified logging layer
- Use Logstash with ELK Stack, Fluentd for cloud-native/Kubernetes
- Jaeger vs Zipkin - Distributed tracing comparison
- Jaeger:
- Uber-developed, CNCF project, better for Kubernetes
- Adaptive sampling, hot-path support
- Zipkin:
- Twitter-developed, simpler setup, more mature
- Better documentation, wider adoption
- Both are good choices; choose based on ecosystem fit
- New Relic vs Datadog - APM comparison
- New Relic:
- Strong APM focus, easier learning curve, better for application monitoring
- Per-host pricing
- Datadog:
- Better infrastructure monitoring, more integrations, real-time analytics
- Per-metric pricing, can be expensive
- Choose based on primary use case (application vs infrastructure focus)
- CloudWatch vs Prometheus for AWS - When to use which?
- CloudWatch:
- Native AWS integration, no setup required, managed service
- Limited retention, AWS-specific
- Prometheus:
- Open-source, flexible, better query language, cross-cloud
- Self-managed, requires setup
- Use CloudWatch for AWS-only, Prometheus for multi-cloud/detailed metrics
- Sentry vs ELK for error tracking - Comparison
- Sentry:
- Specialized error tracking, better error grouping, release tracking
- Developer-friendly, issue assignment
- ELK:
- General-purpose logging, full-text search, broader use cases
- More complex but more flexible
- Use Sentry for application error tracking, ELK for comprehensive logging
Additional Advanced Topics (15 Questions)โ
Observability as Codeโ
- What is Observability as Code?
- Defining monitoring, logging, and alerting configuration as code
- Version control, peer review, automated deployment
- Infrastructure as Code for observability
- What are benefits of GitOps for monitoring?
- Version control for dashboards and alerts
- Reproducible environments
- Easy rollback and audit trail
Service Mesh Observabilityโ
- What is service mesh? How does it help observability?
- Istio, Linkerd, Consul Connect
- Automatic distributed tracing
- Standardized metrics collection
- Traffic visibility without code changes
- What metrics does service mesh provide?
- Request success rates
- Latency distribution
- Service dependencies
- Circuit breaker stats
Cost Optimizationโ
- How to optimize monitoring costs?
- Sampling high-volume metrics
- Data retention policies
- Log level filtering
- Metric aggregation
- Use tiered storage (hot/warm/cold)
- What is metric cardinality explosion? How to prevent it?
- Too many unique label combinations
- Increases storage and query costs
- Prevention: Limit label values, avoid unbounded labels, use label guidelines
Modern Observability Patternsโ
- What is OpenTelemetry?
- Vendor-neutral observability framework
- Unified APIs for traces, metrics, logs
- Auto-instrumentation support
- CNCF project
- What is eBPF in observability?
- Extended Berkeley Packet Filter
- Kernel-level observability without agents
- Low overhead monitoring
- Tools: Pixie, Cilium
- What is continuous profiling?
- Always-on performance profiling
- Production-safe profiling
- Identify performance regressions
- Tools: Pyroscope, Parca
SRE & Reliabilityโ
- What is SRE (Site Reliability Engineering)?
- Applies software engineering to operations
- Focus on reliability, scalability, automation
- Error budgets and SLOs
- What is error budget?
- Acceptable downtime based on SLO
- Balance between reliability and feature velocity
- Example: 99.9% uptime = 43 minutes downtime/month
- What are the four golden signals of SRE?
- Latency: Time to serve requests
- Traffic: Demand on system
- Errors: Rate of failed requests
- Saturation: Resource utilization
Cloud-Native Monitoringโ
- How to monitor Kubernetes clusters?
- Prometheus Operator
- kube-state-metrics
- Node exporter
- cAdvisor for container metrics
- Grafana dashboards
- What is container monitoring? Key metrics
- CPU and memory usage per container
- Container restart count
- Network I/O
- Disk I/O
- Tools: cAdvisor, Datadog, New Relic
- How to monitor serverless applications?
- Cold start duration
- Invocation count and errors
- Duration and memory usage
- CloudWatch for AWS Lambda
- Distributed tracing challenges
Real-World Integration Patterns (10 Questions)โ
- How to integrate Prometheus with Spring Boot microservices?
- Add Micrometer dependency
- Enable Actuator with Prometheus endpoint
- Configure Prometheus scraping
- Create Grafana dashboards
- How to set up centralized logging for microservices?
- Filebeat on each service โ Logstash โ Elasticsearch โ Kibana
- Add correlation ID to all logs
- Structured JSON logging
- Log aggregation pattern
- How to implement health checks across microservices?
- Liveness probes (is service running?)
- Readiness probes (can service handle traffic?)
- Custom health indicators
- Aggregate health status
- How to monitor API Gateway (Kong/AWS API Gateway)?
- Request/response metrics
- Rate limiting metrics
- Authentication success/failure
- Backend service health
- Integration with Prometheus/CloudWatch
- How to integrate Kafka with monitoring systems?
- JMX Exporter for Prometheus
- Consumer lag monitoring (Burrow)
- Kafka Manager/AKHQ for UI
- Alert on lag, under-replicated partitions
- How to monitor database connections in Spring Boot?
- HikariCP metrics via Actuator
- Monitor active, idle, pending connections
- Connection pool saturation alerts
- Query performance with slow query logs
- How to implement circuit breaker monitoring?
- Resilience4j with Micrometer
- Monitor state transitions (closed/open/half-open)
- Success/failure rates
- Visualize in Grafana
- How to trace requests across API Gateway โ Microservices โ Database?
- Spring Cloud Sleuth for trace ID generation
- Propagate trace context in HTTP headers
- Zipkin/Jaeger for trace collection
- Visualize complete request flow
- How to implement custom business metrics?
- MeterRegistry in Spring Boot
- Counter for events (orders, signups)
- Timer for operations
- Gauge for current state
- Export to Prometheus
- How to monitor scheduled jobs/batch processes?
- Job execution time
- Success/failure rate
- Last successful run timestamp
- Dead letter queue monitoring
- Alert on job failures
Troubleshooting Checklist (10 Questions)โ
-
Application is slow - Where to start?
-
Check application metrics (response time, throughput)
-
Review recent deployments/changes
-
Check resource utilization (CPU, memory, disk)
-
Analyze slow queries in database
-
Check external service dependencies
-
Review logs for errors/warnings
-
High memory usage - Investigation steps
-
Take heap dump (jmap, Actuator)
-
Analyze with tools (MAT, VisualVM)
-
Check for memory leaks
-
Review garbage collection logs
-
Check cache sizes
-
Monitor memory growth over time
-
Database queries timing out - Debug approach
-
Enable slow query log
-
Check query execution plans (EXPLAIN)
-
Look for missing indexes
-
Check database connection pool
-
Review lock contention
-
Check database resource utilization
-
Microservice not responding - Troubleshooting
-
Check health endpoint
-
Review application logs
-
Check resource limits (CPU, memory)
-
Verify network connectivity
-
Check dependent services
-
Review recent deployments
-
Redis cache misses increasing - Investigation
-
Check cache hit/miss ratio
-
Verify TTL settings
-
Check memory usage and eviction
-
Review cache key patterns
-
Look for cache invalidation issues
-
Check client connection issues
-
Kafka consumer lag growing - Resolution steps
-
Check consumer processing time
-
Verify partition assignment
-
Scale consumers (add instances)
-
Optimize consumer batch size
-
Check for slow downstream dependencies
-
Review consumer configuration
-
Elasticsearch cluster yellow/red status - Fix
-
Check unassigned shards
-
Verify replica settings
-
Check disk space on nodes
-
Review cluster allocation settings
-
Check for node failures
-
Rebalance shards if needed
-
Prometheus scraping failures - Troubleshooting
-
Verify target is reachable
-
Check firewall/network rules
-
Verify metrics endpoint is exposed
-
Check Prometheus logs
-
Verify service discovery config
-
Test metrics endpoint manually
-
Grafana dashboard not updating - Debug
-
Check data source connection
-
Verify time range selection
-
Check query syntax
-
Review Prometheus/data source availability
-
Check dashboard refresh settings
-
Look for query errors in browser console
-
API Gateway returning 5xx errors - Investigation
-
Check gateway logs
-
Verify backend service health
-
Check timeout configurations
-
Review rate limiting rules
-
Check authentication/authorization
-
Verify routing configuration
Interview Preparation Tipsโ
Common Interview Patternsโ
Pattern 1: Troubleshooting Scenarios
- Always start with data/metrics
- Follow systematic approach
- Consider recent changes
- Think about dependencies
- Propose monitoring improvements
Pattern 2: System Design Questions
- Define requirements first
- Consider scale and load
- Plan for failure scenarios
- Include monitoring from start
- Discuss trade-offs
Pattern 3: Tool Selection
- Understand use case
- Consider scale and cost
- Think about team expertise
- Integration with existing tools
- Open-source vs commercial
Key Concepts to Masterโ
- Metrics Collection: Pull vs Push, sampling, cardinality
- Log Aggregation: Centralization, parsing, storage, retention
- Distributed Tracing: Correlation, context propagation, sampling
- Alerting: Thresholds, alert fatigue, actionable alerts
- Scalability: Horizontal scaling, partitioning, caching
- High Availability: Replication, failover, disaster recovery
Quick Reference Metricsโ
Application:
- Response time (p50, p95, p99)
- Request rate (req/sec)
- Error rate (%)
- Active users/connections
Infrastructure:
- CPU utilization (%)
- Memory usage (%)
- Disk I/O (IOPS, throughput)
- Network I/O (bytes/sec)
Database:
- Query execution time
- Connection pool usage
- Slow query count
- Replication lag
Cache (Redis):
- Hit rate (%)
- Memory usage
- Evicted keys
- Connected clients
Message Queue (Kafka):
- Consumer lag
- Message rate
- Under-replicated partitions
- Broker availability